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Abstract 

PageRank  is  a  way  to  rank  Web  pages  taking  into  account  hyper-link  structure  of 
the  Web.  PageRank  provides  efficient  and  simple  method  to  find  out  ranking  of 
Web  pages  exploiting  hyper-link  structure  of  the  Web.  However,  it  produces  just 
an  approximation  of  the  ranking  since  the  random  surfer  model  uses  just  uniform 
distributions  for  all  situation  of  choice  happening  during  the  surf  process.  In  par¬ 
ticular,  this  implies  that  the  random  surfer  has  no  preferences.  The  assumption 
is  limited  by  its  nature.  Personalized  PageRank  was  designed  to  solve  the  prob¬ 
lem  but  it  is  still  quite  restrictive  since  it  assumes  non-uniform  preferences  just  at 
jumping  to  arbitrary  page  on  the  Web  and  non-preferring  behaviour  when  follow¬ 
ing  outgoing  hyper-links.  Taking  into  account  these  limitations  and  restrictions  of 
PageRank  and  Personalized  PageRank  we  propose  Weighted  PageRank  where  we 
are  free  to  weight  hyper-links  according  any  possible  preferring  behaviour  of  a  user. 
In  particular,  cluster-related  weights  are  considered. 
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1  Introduction  and  Methodology 


1.1  PageRank  and  Weighted  PageRank 


PageRank  is  a  way  to  rank  Web  pages  taking  into  account  hyper-link  structure 
of  the  Web  [2] .  The  easiest  way  to  imagine  the  “physical”  meaning  of  PageR¬ 
ank  is  to  consider  random  surfer  model,  or  a  surfer  who  explores  the  Web  in  a 
random  way.  The  surfer  having  found  herself  at  a  page  jumps  with  some  prob¬ 
ability  to  an  arbitrary  page  on  the  Web  choosing  the  page  uniformly  from  the 
set  of  all  pages  on  the  Web  or  with  complementary  probability  follows  one  of 
hyper-links  of  the  page  choosing  the  outgoing  link  uniformly  from  the  set  of  all 
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the  outgoing  hyper-links  of  the  page.  The  random  surfer  model  forms  ergodic 
Markov  chain  having  a  unique  stationary  probability  distribution.  An  entry 
of  the  stationary  distribution  corresponding  to  a  Web  page  can  be  interpreted 
as  probability  to  hnd  the  surfer  at  the  page  and,  hence,  can  be  considered  as  a 
kind  of  popularity  or  authority  measure  of  the  page.  PageRank  provides  nice 
and  simple  method  to  hnd  out  ranking  of  Web  pages  exploiting  hyper-link 
structure  of  the  Web  but  it  produces  just  an  approximation  of  the  ranking 
since  the  random  surfer  model  uses  uniform  distributions  for  all  situation  of 
choice  happening  during  surf  process  implying  that  the  random  surfer  has  no 
preferences.  The  assumption  is  limited  by  its  nature.  Personalized  PageRank 
was  called  upon  to  solve  the  problem  but  it  is  still  quite  restricted  since  as¬ 
sumes  preferring  behaviour  just  at  jumping  arbitrary  page  on  the  Web  and 
non-preferring  behaviour  during  regular  following  outgoing  hyper-links.  Tak¬ 
ing  into  account  these  limits  and  restrictions  of  PageRank  and  Personalized 
PageRank  we  propose  Weighted  PageRank  where  we  are  free  to  weight  hyper¬ 
links  according  any  possible  preferring  behaviour  of  an  user.  The  most  general 
dehnition  of  Weighted  PageRank  we  can  imagine  is  the  following.  Let  us  de¬ 
note  by  n  the  number  of  pages  on  the  Web.  Let  "P  =  {1,  2,  . . . ,  n}  be  the  set  of 
all  the  Web  pages.  We  use  subscripts  i,  j  for  enumeration  over  the  set  of  all  the 
Web  pages.  Let  assume  that  we  have  inhnite  but  countable  number  of  types  of 
links.  We  denote  the  set  of  link  types  as  T,  where  T  =  N  actually.  We  choose 
weights  for  each  link  type  and  denote  them  by  {ak}'^=i.  We  use  subscript  k 
to  enumerate  link  types.  Link  type  weights  meet  the  following  conditions: 


ttfc  ^  0,  V/c  G  P 

OO 

(1) 

Y.ak  =  l. 

k=l 

(2) 

Let  us  denote  by  Vi  the  set  of  the  outgoing  links  from  page  i,  i  G  V.  It  is 
evident  that  Vi  C  V.  Let  us  denote  by  n*  the  number  of  the  outgoing  link 
from  page  i,  \Vi\  =  rii.  Let  us  denote  by  V^  the  set  of  the  outgoing  links  from 
page  i  of  type  k  E  T.  We  note  that  a  link  can  be  polytypic  and  belong  to 
several  Pj^.  We  assume  here  that  a  link  is  at  least  monotypic.  We  note  that 
U^i  Vi  =  Vi,  and  the  most  of  Pf  are  empty.  Let  us  denote  by  nf  the  number 
of  the  outgoing  link  from  page  i  of  type  k  E  T,  \P^\  =  rii-  Let  us  dehne  a  map 
ti{j)  of  outgoing  links  of  page  i  to  their  types. 

U{j)  ;  P,  ^  2^\{0}. 

ti{j)  is  a  list  of  types  of  link  (f,  j).  Let  us  denote  by  p  the  set  of  link  types 
which  outgoing  links  of  page  i  have: 

P  =  Ujepii(j). 
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We  note  that  ^ 


Now  that  we  have  dehned  a  number  of  auxiliary  notion,  let  us  define  weighted 
matrix  W  which  plays  for  Weighted  PageRank  the  same  role  as  Google  matrix 
for  PageRank. 


Wij  = 


E 


Clm  1 

""  n. 


E 

HTi 


Oik,  j  £  'Pi, 


W. 


=  0,  i  ^Vi. 


(3) 


A  link  of  a  page  receives  a  weight  proportional  to  the  weight  of  its  link  type 
with  a  portion  of  weights  of  link  types  which  are  not  presented  at  the  page. 

Let  us  dehne  one  more  notion:  a  set  of  links  marked  by  their  types. 


Mi  =  e  Vi,mi{j)  e  (4) 


Proposition  1  If  an  original  graph  is  strongly  connected,  which  corresponds 
to  ergodic  Markov  chain,  then  W  is  a  stochastic  matrix. 

Proof  We  shall  proof  that 


=  WieV. 

i=i 


Since  the  original  graph  is  strongly  connected  there  are  not  dangling  pages. 
Let  us  proof  the  statement  for  non-dangling  page  i,  Vi  ^  0. 


E  =  E  +  E  (3) 

j=l  j&Vi  j^Vi 


The  second  term  is  zero.  Let  us  analyze  the  first  term. 
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j&Vi 


jeVi  \m&ti{j)  'h  k^Ti  J 

(6a) 

V  V  ^+yak  = 

j^Vim^ti(j)  ^  k^Ti 

(6b) 

/  .  ^  = 
k^Ti 

(6c) 

y.  y  —  +  yo^k  = 

Z—^  Z—^  Z—^  ^ 

(6d) 

'y  )  Oim  T  y  )  Oik  1 

m&Ti  k^Ti 

(6e) 

We  substitute  definition  of  weight  matrix  entry  in  (6a).  We  open  brackets  and 
sum  the  second  term  in  (6b).  Double  summation  is  equivalent  to  summation 
over  Mi  set  (6c).  Now  we  can  account  elements  of  Mi  by  their  type  (6d). 
Summation  all  the  terms  in  (6e)  applying  (2)  finishes  the  proof.  □ 

Weighted  PageRank  in  its  general  definition  can  produce  almost  any  ranking 
of  the  Web  pages  and  actually  possess  too  high  liberty,  therefore,  it  would  be 
useful  to  consider  special  instantiations  of  Weighted  PageRank  with  particular 
choice  of  link  type  weights. 

Example  1  PageRank  is  a  particular  case  of  Weighted  PageRank. 

We  have  a  graph.  Let  us  assume  that  we  have  two  types  of  links.  Let  us  assume 
that  all  the  links  of  the  graph  are  of  the  first  type  and  of  the  second  type.  Let 
us  consider  the  complement  the  of  graph  to  a  fully  connected  graph.  The  links 
belonging  to  the  complement  we  assume  to  be  of  the  second  type.  Denote  the 
weight  of  the  first  type  by  c,  and  the  weight  of  the  second  type  by  (1  —  c). 
Then  transition  matrix  written  according  to  definition  is  the  following: 


«>«  =  4  +  4.  r}  +  0,  +  vl 


nt 


Hi 


''i  '  "i 

TIjA 


Wij  = 


^  D-,  r}  =  is,vf^ii,jevf. 


Hi 


After  some  simplification  one  can  get: 
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Wij  = 


Wij  = 


Wij  = 


c  c 

~  ^ - 

n-  Hi 

1  —  c 


rii 

1 

n 


'  I 


=  0,p2^0,JeP^ 


1.2  Weights  according  to  a  clustering 


Let  assume  that  we  have  a  clustering  C  of  all  the  pages  on  the  Web.  It  does 
not  matter  how  we  get  the  clustering. 


C  =  {C'i,C'2,...,C'^}, 
uliO  =  V, 

CjnQ  =  0,  i  jt  j,  i,j  ^  1,N. 


Having  the  clustering  we  can  select  two  types  of  links:  links  lying  inside  clus¬ 
ters,  e.g.  links  between  pages  belonging  to  same  cluster,  such  links  are  also 
called  intra-links,  and  links  crossing  cluster  borders,  e.g.  links  between  pages 
belonging  to  different  clusters,  such  links  are  also  called  inter-links.  Let  be 
weights  for  these  two  types  of  links: 


afc  >  0,  k  =  l,2, 

Oil  a2  =  1. 


Let  us  dehne  subindeces  in  the  following  way:  k,m  =  1,2,  k  ^  m. 
The  Weighted  transition  matrix  can  be  simplihed  in  this  case: 


Wij  = 


Wij 


n 

Otk 

i 

oik 


n: 


V,= 


=  ^  +  — ,  3^Vlvr  = 

nf  Ui 

Wij  =  D,  j^Vi. 
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2  Experiments 


2.1  TOmUW 


We  have  calculated  usual  cos  measure  between  tf*idf  vectors  of  documents 
and  tf*idf  vectors  of  topics.  Both  query  and  narration  helds  were  used.  For 
each  query  1000  documents  receiving  highest  values  were  reported.  Non-zero 
value  is  obtained  if  a  document  and  a  topic  has  at  least  one  common  term. 


2.2  4FvfI 


We  calculated  PageRank  for  all  the  documents  in  the  collection  with  damping 
factor  equal  to  0.85.  For  each  query  we  took  the  documents  having  at  least  one 
common  term  with  the  query  and  order  them  according  to  their  PageRank. 
We  reported  the  first  thousand  documents  having  highest  PageRank  values 
and  having  at  least  one  common  term  with  the  query. 


2. 3  Rkylv 


We  calculated  Weighted  PageRank  for  all  the  documents  in  the  collection.  For 
each  query  we  took  the  documents  having  at  least  one  common  term  with  the 
query  and  order  them  according  to  their  Weighted  PageRank.  We  reported 
the  first  thousand  documents  having  highest  Weighted  PageRank  values  and 
having  at  least  one  common  term  with  the  query. 

The  weights  in  the  Weighted  PageRank  have  been  assigned  according  to  a 
clustering.  We  clustered  all  documents  using  CDC  clustering  algorithm  [1,3] 
which  works  with  text  content  of  documents  and  intent  to  cluster  documents 
according  their  topics.  We  gave  different  weight  to  links  between  documents 
belonging  to  the  same  cluster  (intra-links)  and  to  links  between  documents 
belonging  to  different  clusters  (inter-links).  The  weight  of  intra-links  is  equal 
to  0.15  while  the  weight  of  inter-links  equal  to  0.85.  The  weights  were  obtained 
by  optimization  by  the  number  of  relevant  documents  in  the  first  thousand 
according  to  the  topics  and  relevance  judgements  of  TREC2007. 
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Fig.  1.  AP  measure  on  each  topic  with  (left)  and  without  (right)  median  results 
over  all  systems.  Topics  ids  are  placed  at  abscissa  axis,  and  AP  measure  values  are 
places  at  ordinate  axis. 

2-4  ycbLS 


We  have  calculated  usual  cos  measure  between  tf*idf  vectors  of  documents 
and  tf*idf  vectors  of  topics.  Only  query  field  was  used.  For  each  query  1000 
documents  receiving  highest  values  were  reported.  Non-zero  value  is  obtained 
if  a  document  and  a  topic  has  at  least  one  common  term. 


3  Experimental  results 


The  experimental  results  are  preseted  in  Table  1.  The  experiments  do  not 
show  an  improvement  of  retrieval  results  by  using  Weighted  PageRank  with 
cluster  related  weigths,  see  experiment  Rkylv,  comparing  to  PageRank,  ex¬ 
periment  AFvfl. 


TOmUW 

AFvfl 

Rkylv 

ycbLS 

Median 

Best 

infAP 

0.1803 

0.0099 

0.0082 

0.1879 

0.2670 

0.5541 

infNDCG 

0.3852 

0.0535 

0.0584 

0.3785 

0.4679 

0.7803 

Table  1 

Average  measures  over  all  topics.  Median  is  the  average  over  of  all  the  topics  of 
the  median  measures  of  all  the  participated  systems,  and  Best  is  the  average  over 
of  all  the  topics  of  the  best  achieved  result  among  all  the  participated  systems. 

Experimental  resuls  at  each  topic  are  presented  at  Fig.l  and  Fig. 2.  One  can  see 
from  left  graphs  of  Fig.l  and  Fig. 2  that  in  some  cases  using  such  classical  dis¬ 
tance  measure  between  document  and  query  as  cosine  of  tf  *  idf  presentations 


7 
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Fig.  2.  NDCG  measure  on  each  topic  with  (left)  and  without  (right)  median  results 
over  all  systems.  Topics  ids  are  placed  at  abscissa  axis,  and  NDCG  measure  values 
are  places  at  ordinate  axis. 


{TOmUW)  gives  better  results  than  achieved  by  all  systems  at  median. 


We  cannot  conclude  from  right  graphs  of  Fig.l  and  Fig.  2  that  ordering  of  rel¬ 
evant  documents  according  to  Weighted  PageRank  gives  better  performance 
than  ordering  by  PageRank  and  values  in  Table  1  of  AP  measure  and  NDCG 
measure  support  this,  but  AP  measure  for  Weighted  PageRank  is  greater  than 
one  for  PageRank  in  48  cases  from  63,  which  gives  about  76%,  and  NDCG  mea¬ 
sure  -  in  42  cases  from  63,  which  gives  about  67%.  We  performed  Kolmogorov- 
Smirnov  test  to  check  if  the  distribution  of  AP  measures  for  PageRank  and 
Weighted  PageRank  (or  NDCG  measures  for  PageRank  and  Weighted  PageR¬ 
ank)  follow  the  same  distribution  (the  null  hypothesis)  and  got  that  we  cannot 
reject  the  null  hypothesis  up  to  signihcance  level  equal  to  0.9865  for  AP  mea¬ 
sure  (or  0.9293  for  NDCG  measure),  which  makes  us  to  conclude  that  AP 
measure  of  Weighted  PageRank  is  greater  than  AP  measure  for  PageRank  in 
majority  of  cases  is  an  arbitrary  result.  One  can  say  the  same  about  NDCG 
measure. 


Failure  of  link-base  ranking  as  PageRank  and  Weighted  PageRank  in  the  ex¬ 
periments  can  be  explained  by  lack  of  recomendational  links  and  prevalence 
of  navigational  links  in  CSIRO  dataset  since,  being  centrally  ruled  organi¬ 
zation,  CSIRO  attends  to  support  the  very  few  number  of  related  research 
project  which  makes  it  very  differ  from  the  Web  where  PageRank  is  used  with 


success. 
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